vibe coded buildkite prototype #404

msaroufim · 2026-01-29T18:27:14Z

dont bother reviewing!

github-actions · 2026-01-29T18:28:06Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/libkernelbot
consts.py
Project Total

_{This report was generated by python-coverage-comment-action}

Copilot

Pull request overview

This PR introduces a proof-of-concept Buildkite integration for running GPU kernel submissions on vendor-donated hardware. Buildkite provides better isolation guarantees compared to GitHub Actions by binding each agent to a single GPU with CPU/RAM limits enforced via systemd cgroups.

Changes:

Adds BuildkiteLauncher to create and monitor Buildkite builds via API
Implements buildkite-runner.py script that executes inside Docker containers on Buildkite agents
Provides infrastructure scripts for vendors to set up isolated GPU agents
Includes comprehensive documentation and testing guides

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/libkernelbot/launchers/buildkite.py	New launcher that creates Buildkite builds, polls for completion, and downloads artifacts
src/runners/buildkite-runner.py	Runner script that decompresses payload and executes submissions in Buildkite agents
src/libkernelbot/consts.py	Adds BuildkiteGPU enum and BUILDKITE scheduler type
src/libkernelbot/launchers/init.py	Exports BuildkiteLauncher
src/kernelbot/main.py	Registers BuildkiteLauncher if API token is configured
src/kernelbot/env.py	Adds Buildkite environment variables (optional)
tests/test_buildkite.py	Unit tests for BuildkiteLauncher, GPU enums, and runner script
docker/kernelbot-runner/Dockerfile	Container image with CUDA, PyTorch, and kernelbot dependencies
docker/kernelbot-runner/requirements-runner.txt	Python dependencies for the runner image
.buildkite/pipeline.yml	Buildkite pipeline configuration for running submissions in Docker
.github/workflows/build-runner-image.yml	Workflow to automatically build and push runner image
scripts/buildkite/setup-agent.sh	Script for vendors to configure per-GPU agents with resource isolation
scripts/buildkite/update-image.sh	Script for vendors to update runner image and restart agents
docs/docs/vendor-onboarding/buildkite.md	Vendor onboarding guide with setup and troubleshooting instructions
docs/docs/vendor-onboarding/testing-guide.md	Comprehensive testing guide covering unit, container, agent, and E2E tests
docs/BUILDKITE_POC.md	POC documentation describing architecture, implementation, and next steps
src/discord_cluster_manager.egg-info/*	Auto-generated package metadata files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-29T18:35:36Z

src/libkernelbot/launchers/buildkite.py

+        profile_artifact = None
+        for artifact in artifacts:
+            if artifact.get("filename") == "result.json":
+                result_artifact = artifact
+            elif artifact.get("path", "").startswith("profile_data/"):
+                profile_artifact = artifact
+


The profile_artifact variable is only set when iterating over artifacts if a path starts with "profile_data/", but this will only capture the first matching profile artifact. If there are multiple profile artifacts (which is possible since the pipeline config specifies "profile_data/**/*"), only the first one will be stored and its download URL will be added to all ProfileResult objects. This could lead to incorrect profile data associations. Consider either collecting all profile artifacts or being more specific about which profile artifact to use.

Suggested change

profile_artifact = None

for artifact in artifacts:

if artifact.get("filename") == "result.json":

result_artifact = artifact

elif artifact.get("path", "").startswith("profile_data/"):

profile_artifact = artifact

profile_artifacts = []

for artifact in artifacts:

if artifact.get("filename") == "result.json":

result_artifact = artifact

elif artifact.get("path", "").startswith("profile_data/"):

profile_artifacts.append(artifact)

# Determine which profile artifact to use, if any

profile_artifact = None

if len(profile_artifacts) == 1:

profile_artifact = profile_artifacts[0]

elif len(profile_artifacts) > 1:

logger.warning(

"Multiple profile artifacts found; not associating a specific one with runs."

)

Copilot · 2026-01-29T18:35:36Z

src/libkernelbot/consts.py

+class BuildkiteGPU(Enum):
+    # Queue naming: {vendor}-{gpu_type}
+    # Buildkite agents use tags like queue=nvidia-h100-0 for per-GPU routing
+    # The enum value is the queue prefix; agents append -N for specific GPU index
+    NVIDIA_H100 = "nvidia-h100"
+    NVIDIA_B200 = "nvidia-b200"
+    NVIDIA_A100 = "nvidia-a100"
+    AMD_MI300 = "amd-mi300"
+    AMD_MI250 = "amd-mi250"
+    GOOGLE_TPU = "google-tpu"
+    NEBIUS_H100 = "nebius-h100"


The BuildkiteGPU enum includes several GPU types (NVIDIA_H100, NVIDIA_B200, NVIDIA_A100, AMD_MI300, AMD_MI250, GOOGLE_TPU, NEBIUS_H100) but the GPU_TO_SM mapping dictionary (line 127-138 in consts.py) is missing entries for these new GPU types. When _get_arch is called with any of these new Buildkite GPU types, it will raise a KeyError. Add mappings for at least the NVIDIA GPUs: NVIDIA_H100 should map to "90a", NVIDIA_B200 to "100", NVIDIA_A100 to "80", and NEBIUS_H100 to "90a".

Copilot · 2026-01-29T18:35:36Z

.buildkite/pipeline.yml

+    agents:
+      # Route to agent with matching queue tag
+      queue: "${GPU_QUEUE}"
+    timeout_in_minutes: 15


The pipeline timeout_in_minutes is hardcoded to 15 minutes, but the BuildkiteLauncher calculates timeouts dynamically based on config (get_timeout function) and adds a TIMEOUT_BUFFER_MINUTES. For submissions with longer configured timeouts, the Buildkite pipeline will timeout before the launcher's timeout logic can cancel the build. The pipeline timeout should either be set to a higher value or made configurable via an environment variable to align with the launcher's timeout calculation.

Suggested change

timeout_in_minutes: 15

# Pipeline timeout is configurable to align with BuildkiteLauncher timeouts.

timeout_in_minutes: ${PIPELINE_TIMEOUT_MINUTES:-60}

Copilot · 2026-01-29T18:35:37Z

tests/test_buildkite.py

+        return {
+            "lang": "py",
+            "mode": "test",
+            "files": {"main.py": "print('hello')"},


The test config uses "files" but should use "sources" to match the expected config format used by run_config. Looking at other tests in the codebase (e.g., tests/test_backend.py and tests/test_task.py), the config passed to launchers should have "sources" as the key containing file mappings, not "files". This will cause the runner script to fail when it calls run_config with this payload.

Suggested change

"files": {"main.py": "print('hello')"},

"sources": {"main.py": "print('hello')"},

Copilot · 2026-01-29T18:35:37Z

src/libkernelbot/launchers/buildkite.py

+        self.token = token
+        self._headers = {
+            "Authorization": f"Bearer {token}",
+            "Content-Type": "application/json",
+        }
+


The Buildkite API token is stored in plaintext in an instance variable. While this is consistent with how the GitHub launcher handles tokens, consider that tokens could be exposed in logs or error messages. Ensure that repr or str methods don't accidentally expose this token, and that error messages sanitize sensitive information.

Suggested change

self.token = token

self._headers = {

"Authorization": f"Bearer {token}",

"Content-Type": "application/json",

}

# Store the sensitive token in a private attribute to reduce accidental exposure.

self._token = token

self._headers = {

"Authorization": f"Bearer {token}",

"Content-Type": "application/json",

}

@property

def token(self) -> str:

"""

Read-only access to the Buildkite API token.

Kept as a property so external code can still access `launcher.token`,

while the actual storage remains in a private attribute.

"""

return self._token

def __repr__(self) -> str:

"""

Representation of the launcher that avoids exposing the API token.

"""

return (

f"{self.__class__.__name__}(org={self.org!r}, "

f"pipeline={self.pipeline!r})"

)

Copilot · 2026-01-29T18:35:37Z

src/libkernelbot/launchers/buildkite.py

+                    if state != "passed":
+                        logger.warning(f"Build #{build_number} finished with state: {state}")
+                    return


The _wait_for_completion method returns successfully for builds in states "failed", "canceled", or "blocked", not just "passed". This means the method will return normally even when the build didn't succeed, and then the code proceeds to download artifacts. A failed or canceled build may not have artifacts, which could lead to confusing error messages. Consider checking the state after _wait_for_completion returns and returning an appropriate error FullResult for non-passed states before attempting artifact download.

Suggested change

if state != "passed":

logger.warning(f"Build #{build_number} finished with state: {state}")

return

if state == "passed":

return

logger.warning(f"Build #{build_number} finished with state: {state}")

raise RuntimeError(

f"Build #{build_number} finished with non-passed state: {state}"

)

Copilot · 2026-01-29T18:35:37Z

src/runners/buildkite-runner.py

+def main():
+    # Get payload from environment variable
+    payload_b64 = os.environ.get("SUBMISSION_PAYLOAD")
+    if not payload_b64:
+        raise RuntimeError("SUBMISSION_PAYLOAD environment variable not set")
+
+    # Decompress and parse config
+    payload = zlib.decompress(base64.b64decode(payload_b64)).decode("utf-8")
+    config = json.loads(payload)
+
+    # Run the submission
+    result = run_config(config)
+
+    # Write result to file for artifact upload
+    result_dict = asdict(result)
+    Path("result.json").write_text(json.dumps(result_dict, default=serialize))


The buildkite-runner.py main function does not wrap the entire execution in a try-except block. If run_config raises an exception (e.g., due to an invalid config or runtime error), the exception will propagate and no result.json will be created. This will cause the artifact download in BuildkiteLauncher to fail with a confusing "Could not find result.json artifact" error rather than showing the actual error. Consider wrapping the execution in try-except and writing an error result to result.json on failure.

Copilot · 2026-01-29T18:35:37Z

tests/test_buildkite.py

+                    mock_download.return_value = MagicMock(success=True)
+
+                    reporter = MockProgressReporter()
+                    result = await launcher.run_submission(mock_config, gpu_type, reporter)


Variable result is not used.

Suggested change

result = await launcher.run_submission(mock_config, gpu_type, reporter)

result = await launcher.run_submission(mock_config, gpu_type, reporter)

assert result.success is True

vibe coded buildkite prototype

f58e69b

Copilot AI review requested due to automatic review settings January 29, 2026 18:27

Copilot started reviewing on behalf of msaroufim January 29, 2026 18:27 View session

msaroufim added 2 commits January 29, 2026 10:33

update

c446f98

update

d11eafb

Copilot AI reviewed Jan 29, 2026

View reviewed changes

update

fa8084a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vibe coded buildkite prototype #404

vibe coded buildkite prototype #404

Uh oh!

msaroufim commented Jan 29, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	timeout_in_minutes: 15
	# Pipeline timeout is configurable to align with BuildkiteLauncher timeouts.
	timeout_in_minutes: ${PIPELINE_TIMEOUT_MINUTES:-60}

	"files": {"main.py": "print('hello')"},
	"sources": {"main.py": "print('hello')"},

-        self.token = token
-        self._headers = {
-            "Authorization": f"Bearer {token}",
-            "Content-Type": "application/json",
-        }
+        # Store the sensitive token in a private attribute to reduce accidental exposure.
+        self._token = token
+        self._headers = {
+            "Authorization": f"Bearer {token}",
+            "Content-Type": "application/json",
+        }
+    @property
+    def token(self) -> str:
+        """
+        Read-only access to the Buildkite API token.
+        Kept as a property so external code can still access `launcher.token`,
+        while the actual storage remains in a private attribute.
+        """
+        return self._token
+    def __repr__(self) -> str:
+        """
+        Representation of the launcher that avoids exposing the API token.
+        """
+        return (
+            f"{self.__class__.__name__}(org={self.org!r}, "
+            f"pipeline={self.pipeline!r})"
+        )

	result = await launcher.run_submission(mock_config, gpu_type, reporter)
	result = await launcher.run_submission(mock_config, gpu_type, reporter)
	assert result.success is True

vibe coded buildkite prototype #404

Are you sure you want to change the base?

vibe coded buildkite prototype #404

Uh oh!

Conversation

msaroufim commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 29, 2026

Coverage report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

msaroufim commented Jan 29, 2026 •

edited

Loading